<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>ArXiv CS.CV Papers (Image/Video Generation) - May 02, 2025</title>
    <script src="https://cdn.tailwindcss.com"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/framer-motion/10.16.4/framer-motion.dev.js"></script>
    <!-- Example using Font Awesome (replace with your preferred icon library if needed) -->
    <link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.4.2/css/all.min.css">
    <style>
        @import url('https://fonts.googleapis.com/css2?family=Inter:wght@400;600;700&display=swap');

        :root {
            /* New Palette: Light, Clean, Futuristic with Teal/Aqua accents */
            --bg-color: #f8fafc; /* Tailwind slate-50 (Very Light Gray) */
            --card-bg-color: #ffffff; /* White */
            --text-color: #1e293b; /* Tailwind slate-800 (Dark Gray-Blue) */
            --text-muted-color: #64748b; /* Tailwind slate-500 (Medium Gray-Blue) */
            --header-color: #0f172a; /* Tailwind slate-900 (Very Dark Blue) */
            --highlight-primary: #14b8a6; /* Tailwind teal-500 */
            --highlight-secondary: #67e8f9; /* Tailwind cyan-300 */
            --border-color: #e2e8f0; /* Tailwind slate-200 (Light Gray) */
            --shadow-color: rgba(15, 23, 42, 0.08); /* Subtle shadow based on slate-900 */
        }

        body {
            background-color: var(--bg-color);
            color: var(--text-color);
            font-family: 'Inter', sans-serif;
            overflow-x: hidden; /* Prevent horizontal scroll */
            line-height: 1.6;
        }

        .bento-grid {
            display: grid;
            gap: 1.5rem; /* Tailwind gap-6 */
            grid-template-columns: 1fr; /* Force single column */
            padding-bottom: 4rem; /* Add padding at the bottom */
        }

        .bento-item {
            /* Apply semi-transparent white background and blur */
            background-color: rgba(255, 255, 255, 0.7); /* White with 70% opacity */
            backdrop-filter: blur(10px); /* Apply blur effect */
            -webkit-backdrop-filter: blur(10px); /* Safari prefix */
            border-radius: 1rem; /* Slightly larger radius */
            padding: 1.75rem; /* Slightly more padding */
            border: 1px solid rgba(226, 232, 240, 0.5); /* Lighter border with transparency */
            box-shadow: 0 4px 12px var(--shadow-color);
            transition: transform 0.3s ease-out, box-shadow 0.3s ease-out, background-color 0.3s ease-out;
            overflow: hidden; /* Ensure content doesn't overflow */
            position: relative; /* For potential pseudo-elements */
        }

        /* Removed ::before pseudo-element for a cleaner look */


        .bento-item:hover {
            transform: translateY(-6px);
            box-shadow: 0 10px 20px var(--shadow-color), 0 4px 8px rgba(15, 23, 42, 0.06); /* Adjusted hover shadow */
        }

        .paper-title {
            font-size: 1.125rem; /* Tailwind text-lg */
            font-weight: 600; /* Tailwind font-semibold */
            color: var(--highlight-primary); /* Use new primary highlight */
            margin-bottom: 0.75rem; /* Tailwind mb-3 */
            line-height: 1.4;
        }

        .paper-summary {
            font-size: 0.875rem; /* Tailwind text-sm */
            color: var(--text-muted-color);
            margin-bottom: 1.25rem; /* Tailwind mb-5 */
            line-height: 1.6;
        }

        .paper-link {
            display: inline-flex; /* Use flex for icon alignment */
            align-items: center;
            font-size: 0.875rem; /* Tailwind text-sm */
            font-weight: 600;
            color: var(--highlight-primary);
            text-decoration: none;
            padding: 0.5rem 1rem; /* Add padding */
            border-radius: 0.5rem; /* Slightly rounder */
            background-color: rgba(20, 184, 166, 0.08); /* Subtle teal background */
            border: 1px solid rgba(20, 184, 166, 0.2);
            transition: background-color 0.3s ease, color 0.3s ease, transform 0.2s ease;
        }

        .paper-link i {
            margin-right: 0.5rem; /* Tailwind mr-2 */
            transition: transform 0.3s ease;
        }

        .paper-link:hover {
            background-color: rgba(20, 184, 166, 0.15);
            color: #0d9488; /* Darker teal on hover */
            transform: translateY(-1px);
        }
        .paper-link:hover i {
             transform: translateX(2px);
        }

        .paper-authors {
            font-size: 0.75rem; /* Tailwind text-xs */
            color: var(--text-muted-color);
            margin-top: 1rem; /* Tailwind mt-4 */
            font-style: italic;
        }

        .header {
            text-align: center;
            margin-bottom: 3rem; /* Tailwind mb-12 */
            padding-top: 3rem; /* Tailwind pt-12 */
        }

        .header h1 {
            font-size: 2.5rem; /* Tailwind text-4xl or 5xl */
            font-weight: 700; /* Tailwind font-bold */
            color: var(--header-color);
            letter-spacing: -0.025em; /* Tailwind tracking-tight */
            margin-bottom: 0.5rem;
            /* Optional: Add a subtle text gradient */
            /* background: linear-gradient(90deg, var(--highlight-primary), var(--highlight-secondary)); */
            /* -webkit-background-clip: text; */
            /* -webkit-text-fill-color: transparent; */
        }

        .header p {
            font-size: 1.125rem; /* Tailwind text-lg */
            color: var(--text-muted-color);
            margin-top: 0.5rem; /* Tailwind mt-2 */
            max-width: 600px;
            margin-left: auto;
            margin-right: auto;
        }

        .footer {
            text-align: center;
            color: var(--text-muted-color);
            font-size: 0.875rem; /* Tailwind text-sm */
            padding-top: 2rem;
            padding-bottom: 2rem; /* Tailwind py-8 */
            border-top: 1px solid var(--border-color);
            margin-top: 4rem;
        }

        /* Simple line graphic element (optional) */
        .line-graphic {
            height: 1px; /* Thinner line */
            background: linear-gradient(90deg, rgba(20, 184, 166, 0), var(--highlight-primary), rgba(20, 184, 166, 0));
            opacity: 0.6;
            margin: 1.5rem 0; /* Adjust margin */
        }

        /* Framer Motion requires the script, styles enhance appearance */
        [data-motion-element] {
             /* Base styles for elements animated by Framer Motion */
        }

        .paper-tldr {
            font-size: 0.95rem; /* Slightly bigger than summary */
            color: #475569; /* Changed to Tailwind slate-600 (slightly darker than summary) */
            margin-top: 0.75rem; /* Tailwind mt-3 */
            margin-bottom: 0.75rem; /* Tailwind mb-2 */
            /* font-style: italic; */
            font-weight: bold;
        }

        .paper-rating {
            margin-top: 1rem; /* Tailwind mt-4 */
            margin-bottom: 1rem; /* Tailwind mb-4 */
            color: #f59e0b; /* Tailwind amber-500 */
        }

        .paper-rating i {
            margin-right: 0.125rem; /* Tailwind mr-0.5 */
        }

        /* Apply consistent star color to sub-ratings */
        .paper-sub-ratings .rating-item i {
            color: #f59e0b; /* Match overall rating star color (amber-500) */
            margin-right: 0.125rem; /* Consistent spacing */
        }

    </style>
</head>
<body class="container mx-auto px-4 antialiased">

    <motion.div
        initial="{ opacity: 0, y: -30 }"
        animate="{ opacity: 1, y: 0 }"
        transition="{ duration: 0.6, ease: 'easeOut' }"
        class="header"
        data-motion-element
    >
        <h1>AIGC Daily Papers</h1>
        <p>Daily papers related to Image/Video/Multimodal Generation from cs.CV</p>
        <p>May 02, 2025</p>
        <div class="line-graphic mt-4 mb-8 mx-auto w-1/4"></div> <!-- Added line graphic -->
    </motion.div>

    <div class="bento-grid" id="paper-grid">
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.0, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">T2VPhysBench: A First-Principles Benchmark for Physical Consistency in Text-to-Video Generation</h2>
            <p class="paper-summary">Text-to-video generative models have made significant strides in recent
years, producing high-quality videos that excel in both aesthetic appeal and
accurate instruction following, and have become central to digital art creation
and user engagement online. Yet, despite these advancements, their ability to
respect fundamental physical laws remains largely untested: many outputs still
violate basic constraints such as rigid-body collisions, energy conservation,
and gravitational dynamics, resulting in unrealistic or even misleading
content. Existing physical-evaluation benchmarks typically rely on automatic,
pixel-level metrics applied to simplistic, life-scenario prompts, and thus
overlook both human judgment and first-principles physics. To fill this gap, we
introduce \textbf{T2VPhysBench}, a first-principled benchmark that
systematically evaluates whether state-of-the-art text-to-video systems, both
open-source and commercial, obey twelve core physical laws including Newtonian
mechanics, conservation principles, and phenomenological effects. Our benchmark
employs a rigorous human evaluation protocol and includes three targeted
studies: (1) an overall compliance assessment showing that all models score
below 0.60 on average in each law category; (2) a prompt-hint ablation
revealing that even detailed, law-specific hints fail to remedy physics
violations; and (3) a counterfactual robustness test demonstrating that models
often generate videos that explicitly break physical rules when so instructed.
The results expose persistent limitations in current architectures and offer
concrete insights for guiding future research toward truly physics-aware video
generation.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces t2vphysbench, a new benchmark to evaluate the physical consistency of text-to-video models, revealing their significant limitations in adhering to fundamental physical laws.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了t2vphysbench，一个新的基准测试，用于评估文本到视频模型的物理一致性，揭示了它们在遵守基本物理定律方面的重大局限性。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(10/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(9/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.00337v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Xuyang Guo, Jiayan Huo, Zhenmei Shi, Zhao Song, Jiahao Zhang, Jiale Zhao</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.05, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Controllable Weather Synthesis and Removal with Video Diffusion Models</h2>
            <p class="paper-summary">Generating realistic and controllable weather effects in videos is valuable
for many applications. Physics-based weather simulation requires precise
reconstructions that are hard to scale to in-the-wild videos, while current
video editing often lacks realism and control. In this work, we introduce
WeatherWeaver, a video diffusion model that synthesizes diverse weather effects
-- including rain, snow, fog, and clouds -- directly into any input video
without the need for 3D modeling. Our model provides precise control over
weather effect intensity and supports blending various weather types, ensuring
both realism and adaptability. To overcome the scarcity of paired training
data, we propose a novel data strategy combining synthetic videos, generative
image editing, and auto-labeled real-world videos. Extensive evaluations show
that our method outperforms state-of-the-art methods in weather simulation and
removal, providing high-quality, physically plausible, and
scene-identity-preserving results over various real-world videos.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces weatherweaver, a video diffusion model for synthesizing and removing controllable weather effects in videos using a novel data strategy to address the lack of paired training data.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了weatherweaver，一种视频扩散模型，用于合成和移除视频中可控的天气效果，并使用创新的数据策略来解决配对训练数据不足的问题。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.00704v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Chih-Hao Lin, Zian Wang, Ruofan Liang, Yuxuan Zhang, Sanja Fidler, Shenlong Wang, Zan Gojcic</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.1, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">T2I-R1: Reinforcing Image Generation with Collaborative Semantic-level and Token-level CoT</h2>
            <p class="paper-summary">Recent advancements in large language models have demonstrated how
chain-of-thought (CoT) and reinforcement learning (RL) can improve performance.
However, applying such reasoning strategies to the visual generation domain
remains largely unexplored. In this paper, we present T2I-R1, a novel
reasoning-enhanced text-to-image generation model, powered by RL with a
bi-level CoT reasoning process. Specifically, we identify two levels of CoT
that can be utilized to enhance different stages of generation: (1) the
semantic-level CoT for high-level planning of the prompt and (2) the
token-level CoT for low-level pixel processing during patch-by-patch
generation. To better coordinate these two levels of CoT, we introduce
BiCoT-GRPO with an ensemble of generation rewards, which seamlessly optimizes
both generation CoTs within the same training step. By applying our reasoning
strategies to the baseline model, Janus-Pro, we achieve superior performance
with 13% improvement on T2I-CompBench and 19% improvement on the WISE
benchmark, even surpassing the state-of-the-art model FLUX.1. Code is available
at: https://github.com/CaraJ7/T2I-R1</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces t2i-r1, a novel text-to-image generation model that uses reinforcement learning with a bi-level chain-of-thought (cot) reasoning process to improve performance, achieving state-of-the-art results on t2i-compbench and wise benchmarks.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了 t2i-r1, 一种新型的文本到图像生成模型，它使用强化学习和双层思维链（cot）推理过程来提高性能，并在 t2i-compbench 和 wise 基准测试中取得了最先进的结果。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.00703v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Dongzhi Jiang, Ziyu Guo, Renrui Zhang, Zhuofan Zong, Hao Li, Le Zhuo, Shilin Yan, Pheng-Ann Heng, Hongsheng Li</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.15000000000000002, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">KeySync: A Robust Approach for Leakage-free Lip Synchronization in High Resolution</h2>
            <p class="paper-summary">Lip synchronization, known as the task of aligning lip movements in an
existing video with new input audio, is typically framed as a simpler variant
of audio-driven facial animation. However, as well as suffering from the usual
issues in talking head generation (e.g., temporal consistency), lip
synchronization presents significant new challenges such as expression leakage
from the input video and facial occlusions, which can severely impact
real-world applications like automated dubbing, but are often neglected in
existing works. To address these shortcomings, we present KeySync, a two-stage
framework that succeeds in solving the issue of temporal consistency, while
also incorporating solutions for leakage and occlusions using a carefully
designed masking strategy. We show that KeySync achieves state-of-the-art
results in lip reconstruction and cross-synchronization, improving visual
quality and reducing expression leakage according to LipLeak, our novel leakage
metric. Furthermore, we demonstrate the effectiveness of our new masking
approach in handling occlusions and validate our architectural choices through
several ablation studies. Code and model weights can be found at
https://antonibigata.github.io/KeySync.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: keysync is a two-stage framework that achieves state-of-the-art lip synchronization by addressing temporal consistency, expression leakage, and occlusions using a novel masking strategy and leakage metric.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: keysync是一个两阶段框架，通过使用一种新颖的掩蔽策略和泄漏指标，解决了时间一致性、表情泄漏和遮挡等问题，从而实现了最先进的唇部同步。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.00497v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Antoni Bigata, Rodrigo Mira, Stella Bounareli, Michał Stypułkowski, Konstantinos Vougioukas, Stavros Petridis, Maja Pantic</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.2, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers</h2>
            <p class="paper-summary">We present JointDiT, a diffusion transformer that models the joint
distribution of RGB and depth. By leveraging the architectural benefit and
outstanding image prior of the state-of-the-art diffusion transformer, JointDiT
not only generates high-fidelity images but also produces geometrically
plausible and accurate depth maps. This solid joint distribution modeling is
achieved through two simple yet effective techniques that we propose, i.e.,
adaptive scheduling weights, which depend on the noise levels of each modality,
and the unbalanced timestep sampling strategy. With these techniques, we train
our model across all noise levels for each modality, enabling JointDiT to
naturally handle various combinatorial generation tasks, including joint
generation, depth estimation, and depth-conditioned image generation by simply
controlling the timestep of each branch. JointDiT demonstrates outstanding
joint generation performance. Furthermore, it achieves comparable results in
depth estimation and depth-conditioned image generation, suggesting that joint
distribution modeling can serve as a replaceable alternative to conditional
generation. The project page is available at
https://byungki-k.github.io/JointDiT/.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: jointdit is a diffusion transformer that jointly models rgb and depth information using adaptive scheduling weights and unbalanced timestep sampling, achieving strong performance in joint generation, depth estimation, and depth-conditioned image generation.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: jointdit是一个扩散transformer，它使用自适应调度权重和非平衡时间步采样，联合建模rgb和深度信息，在联合生成、深度估计和深度条件图像生成方面表现出色。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.00482v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Kwon Byung-Ki, Qi Dai, Lee Hyoseok, Chong Luo, Tae-Hyun Oh</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.25, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models</h2>
            <p class="paper-summary">The rapid evolution of social media has provided enhanced communication
channels for individuals to create online content, enabling them to express
their thoughts and opinions. Multimodal memes, often utilized for playful or
humorous expressions with visual and textual elements, are sometimes misused to
disseminate hate speech against individuals or groups. While the detection of
hateful memes is well-researched, developing effective methods to transform
hateful content in memes remains a significant challenge. Leveraging the
powerful generation and reasoning capabilities of Vision-Language Models
(VLMs), we address the tasks of detecting and mitigating hateful content. This
paper presents two key contributions: first, a definition-guided prompting
technique for detecting hateful memes, and second, a unified framework for
mitigating hateful content in memes, named UnHateMeme, which works by replacing
hateful textual and/or visual components. With our definition-guided prompts,
VLMs achieve impressive performance on hateful memes detection task.
Furthermore, our UnHateMeme framework, integrated with VLMs, demonstrates a
strong capability to convert hateful memes into non-hateful forms that meet
human-level criteria for hate speech and maintain multimodal coherence between
image and text. Through empirical experiments, we show the effectiveness of
state-of-the-art pretrained VLMs such as LLaVA, Gemini and GPT-4o on the
proposed tasks, providing a comprehensive analysis of their respective
strengths and limitations for these tasks. This paper aims to shed light on
important applications of VLMs for ensuring safe and respectful online
environments.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper introduces a method for detecting and mitigating hateful content in multimodal memes using vision-language models (vlms), including a definition-guided prompting technique for detection and a framework (unhatememe) for transforming hateful memes into non-hateful ones by replacing textual/visual components.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文介绍了一种使用视觉-语言模型（vlms）检测和缓解多模态表情包中仇恨内容的方法，包括用于检测的定义引导提示技术和一个框架（unhatememe），通过替换文本/视觉组件将仇恨表情包转换为非仇恨表情包。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.00150v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Minh-Hao Van, Xintao Wu</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.30000000000000004, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Eye2Eye: A Simple Approach for Monocular-to-Stereo Video Synthesis</h2>
            <p class="paper-summary">The rising popularity of immersive visual experiences has increased interest
in stereoscopic 3D video generation. Despite significant advances in video
synthesis, creating 3D videos remains challenging due to the relative scarcity
of 3D video data. We propose a simple approach for transforming a text-to-video
generator into a video-to-stereo generator. Given an input video, our framework
automatically produces the video frames from a shifted viewpoint, enabling a
compelling 3D effect. Prior and concurrent approaches for this task typically
operate in multiple phases, first estimating video disparity or depth, then
warping the video accordingly to produce a second view, and finally inpainting
the disoccluded regions. This approach inherently fails when the scene involves
specular surfaces or transparent objects. In such cases, single-layer disparity
estimation is insufficient, resulting in artifacts and incorrect pixel shifts
during warping. Our work bypasses these restrictions by directly synthesizing
the new viewpoint, avoiding any intermediate steps. This is achieved by
leveraging a pre-trained video model's priors on geometry, object materials,
optics, and semantics, without relying on external geometry models or manually
disentangling geometry from the synthesis process. We demonstrate the
advantages of our approach in complex, real-world scenarios featuring diverse
object materials and compositions. See videos on
https://video-eye2eye.github.io</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper proposes a novel "eye2eye" approach for monocular-to-stereo video synthesis by directly synthesizing the second viewpoint using a pre-trained video model, bypassing explicit depth or disparity estimation and mitigating artifacts in complex scenes.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文提出了一种名为 “eye2eye” 的新方法，用于从单目视频合成立体视频，它通过直接使用预训练的视频模型来合成第二个视角，绕过了显式的深度或视差估计，并减少了复杂场景中的伪影。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(8/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.00135v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Michal Geyer, Omer Tov, Linyi Jin, Richard Tucker, Inbar Mosseri, Tali Dekel, Noah Snavely</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.35000000000000003, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">RayZer: A Self-supervised Large View Synthesis Model</h2>
            <p class="paper-summary">We present RayZer, a self-supervised multi-view 3D Vision model trained
without any 3D supervision, i.e., camera poses and scene geometry, while
exhibiting emerging 3D awareness. Concretely, RayZer takes unposed and
uncalibrated images as input, recovers camera parameters, reconstructs a scene
representation, and synthesizes novel views. During training, RayZer relies
solely on its self-predicted camera poses to render target views, eliminating
the need for any ground-truth camera annotations and allowing RayZer to be
trained with 2D image supervision. The emerging 3D awareness of RayZer is
attributed to two key factors. First, we design a self-supervised framework,
which achieves 3D-aware auto-encoding of input images by disentangling camera
and scene representations. Second, we design a transformer-based model in which
the only 3D prior is the ray structure, connecting camera, pixel, and scene
simultaneously. RayZer demonstrates comparable or even superior novel view
synthesis performance than ``oracle'' methods that rely on pose annotations in
both training and testing. Project: https://hwjiang1510.github.io/RayZer/</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: rayzer is a self-supervised model that synthesizes novel views from unposed images, recovering camera parameters and reconstructing scenes without 3d supervision, achieving comparable or better performance than methods requiring pose annotations.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: rayzer是一个自监督模型，可以从无姿态图像中合成新的视图，在没有3d监督的情况下恢复相机参数和重建场景，并且实现了与需要姿势注释的方法相比具有竞争力甚至更好的性能。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(6/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(7/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.00702v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, Georgios Pavlakos</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.4, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">GuideSR: Rethinking Guidance for One-Step High-Fidelity Diffusion-Based Super-Resolution</h2>
            <p class="paper-summary">In this paper, we propose GuideSR, a novel single-step diffusion-based image
super-resolution (SR) model specifically designed to enhance image fidelity.
Existing diffusion-based SR approaches typically adapt pre-trained generative
models to image restoration tasks by adding extra conditioning on a
VAE-downsampled representation of the degraded input, which often compromises
structural fidelity. GuideSR addresses this limitation by introducing a
dual-branch architecture comprising: (1) a Guidance Branch that preserves
high-fidelity structures from the original-resolution degraded input, and (2) a
Diffusion Branch, which a pre-trained latent diffusion model to enhance
perceptual quality. Unlike conventional conditioning mechanisms, our Guidance
Branch features a tailored structure for image restoration tasks, combining
Full Resolution Blocks (FRBs) with channel attention and an Image Guidance
Network (IGN) with guided attention. By embedding detailed structural
information directly into the restoration pipeline, GuideSR produces sharper
and more visually consistent results. Extensive experiments on benchmark
datasets demonstrate that GuideSR achieves state-of-the-art performance while
maintaining the low computational cost of single-step approaches, with up to
1.39dB PSNR gain on challenging real-world datasets. Our approach consistently
outperforms existing methods across various reference-based metrics including
PSNR, SSIM, LPIPS, DISTS and FID, further representing a practical advancement
for real-world image restoration.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: guidesr introduces a novel single-step diffusion-based image super-resolution model that uses a dual-branch architecture to enhance image fidelity by preserving high-fidelity structures from the original-resolution degraded input and enhancing perceptual quality using a pre-trained latent diffusion model.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: guidesr 提出了一种新颖的单步扩散图像超分辨率模型，该模型使用双分支架构来增强图像保真度，通过保留原始分辨率降级输入中的高保真度结构，并使用预训练的潜在扩散模型来增强感知质量。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(6/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(7/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.00687v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Aditya Arora, Zhengzhong Tu, Yufei Wang, Ruizheng Bai, Jian Wang, Sizhuo Ma</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.45, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Efficient Neural Video Representation with Temporally Coherent Modulation</h2>
            <p class="paper-summary">Implicit neural representations (INR) has found successful applications
across diverse domains. To employ INR in real-life, it is important to speed up
training. In the field of INR for video applications, the state-of-the-art
approach employs grid-type parametric encoding and successfully achieves a
faster encoding speed in comparison to its predecessors. However, the grid
usage, which does not consider the video's dynamic nature, leads to redundant
use of trainable parameters. As a result, it has significantly lower parameter
efficiency and higher bitrate compared to NeRV-style methods that do not use a
parametric encoding. To address the problem, we propose Neural Video
representation with Temporally coherent Modulation (NVTM), a novel framework
that can capture dynamic characteristics of video. By decomposing the
spatio-temporal 3D video data into a set of 2D grids with flow information,
NVTM enables learning video representation rapidly and uses parameter
efficiently. Our framework enables to process temporally corresponding pixels
at once, resulting in the fastest encoding speed for a reasonable video
quality, especially when compared to the NeRV-style method, with a speed
increase of over 3 times. Also, it remarks an average of 1.54dB/0.019
improvements in PSNR/LPIPS on UVG (Dynamic) (even with 10% fewer parameters)
and an average of 1.84dB/0.013 improvements in PSNR/LPIPS on MCL-JCV (Dynamic),
compared to previous grid-type works. By expanding this to compression tasks,
we demonstrate comparable performance to video compression standards (H.264,
HEVC) and recent INR approaches for video compression. Additionally, we perform
extensive experiments demonstrating the superior performance of our algorithm
across diverse tasks, encompassing super resolution, frame interpolation and
video inpainting. Project page is https://sujiikim.github.io/NVTM/.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces a novel neural video representation framework (nvtm) that speeds up training and improves parameter efficiency by using temporally coherent modulation, outperforming existing methods in video representation tasks.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文介绍了一种新型神经视频表示框架（nvtm），它通过使用时间相干调制来加速训练并提高参数效率，在视频表示任务中优于现有方法。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(7/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.00335v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Seungjun Shin, Suji Kim, Dokwan Oh</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.5, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Quaternion Wavelet-Conditioned Diffusion Models for Image Super-Resolution</h2>
            <p class="paper-summary">Image Super-Resolution is a fundamental problem in computer vision with broad
applications spacing from medical imaging to satellite analysis. The ability to
reconstruct high-resolution images from low-resolution inputs is crucial for
enhancing downstream tasks such as object detection and segmentation. While
deep learning has significantly advanced SR, achieving high-quality
reconstructions with fine-grained details and realistic textures remains
challenging, particularly at high upscaling factors. Recent approaches
leveraging diffusion models have demonstrated promising results, yet they often
struggle to balance perceptual quality with structural fidelity. In this work,
we introduce ResQu a novel SR framework that integrates a quaternion wavelet
preprocessing framework with latent diffusion models, incorporating a new
quaternion wavelet- and time-aware encoder. Unlike prior methods that simply
apply wavelet transforms within diffusion models, our approach enhances the
conditioning process by exploiting quaternion wavelet embeddings, which are
dynamically integrated at different stages of denoising. Furthermore, we also
leverage the generative priors of foundation models such as Stable Diffusion.
Extensive experiments on domain-specific datasets demonstrate that our method
achieves outstanding SR results, outperforming in many cases existing
approaches in perceptual quality and standard evaluation metrics. The code will
be available after the revision process.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces resqu, a super-resolution framework using quaternion wavelet-conditioned diffusion models, leveraging stable diffusion priors to enhance perceptual quality and structural fidelity. it claims superior performance in domain-specific datasets.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文介绍了一种名为resqu的超分辨率框架，该框架采用四元数小波条件扩散模型，并利用stable diffusion先验来增强感知质量和结构保真度。作者声称该方法在特定领域的数据集上表现优异。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(7/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.00334v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Luigi Sigillo, Christian Bianchi, Danilo Comminiello</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.55, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Multimodal Masked Autoencoder Pre-training for 3D MRI-Based Brain Tumor Analysis with Missing Modalities</h2>
            <p class="paper-summary">Multimodal magnetic resonance imaging (MRI) constitutes the first line of
investigation for clinicians in the care of brain tumors, providing crucial
insights for surgery planning, treatment monitoring, and biomarker
identification. Pre-training on large datasets have been shown to help models
learn transferable representations and adapt with minimal labeled data. This
behavior is especially valuable in medical imaging, where annotations are often
scarce. However, applying this paradigm to multimodal medical data introduces a
challenge: most existing approaches assume that all imaging modalities are
available during both pre-training and fine-tuning. In practice, missing
modalities often occur due to acquisition issues, specialist unavailability, or
specific experimental designs on small in-house datasets. Consequently, a
common approach involves training a separate model for each desired modality
combination, making the process both resource-intensive and impractical for
clinical use. Therefore, we introduce BM-MAE, a masked image modeling
pre-training strategy tailored for multimodal MRI data. The same pre-trained
model seamlessly adapts to any combination of available modalities, extracting
rich representations that capture both intra- and inter-modal information. This
allows fine-tuning on any subset of modalities without requiring architectural
changes, while still benefiting from a model pre-trained on the full set of
modalities. Extensive experiments show that the proposed pre-training strategy
outperforms or remains competitive with baselines that require separate
pre-training for each modality subset, while substantially surpassing training
from scratch on several downstream tasks. Additionally, it can quickly and
efficiently reconstruct missing modalities, highlighting its practical value.
Code and trained models are available at: https://github.com/Lucas-rbnt/bmmae</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper introduces bm-mae, a masked autoencoder pre-training strategy for 3d mri brain tumor analysis that addresses the challenge of missing modalities, allowing for seamless adaptation and improved performance compared to modality-specific pre-training.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文介绍了bm-mae，一种用于3d mri脑肿瘤分析的掩码自动编码器预训练策略，它解决了缺失模态的挑战，与模态特定的预训练相比，实现了无缝适应和性能提升。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(3/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(6/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(5/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.00568v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Lucas Robinet, Ahmad Berjaoui, Elizabeth Cohen-Jonathan Moyal</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.6000000000000001, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Neuroevolution of Self-Attention Over Proto-Objects</h2>
            <p class="paper-summary">Proto-objects - image regions that share common visual properties - offer a
promising alternative to traditional attention mechanisms based on
rectangular-shaped image patches in neural networks. Although previous work
demonstrated that evolving a patch-based hard-attention module alongside a
controller network could achieve state-of-the-art performance in visual
reinforcement learning tasks, our approach leverages image segmentation to work
with higher-level features. By operating on proto-objects rather than fixed
patches, we significantly reduce the representational complexity: each image
decomposes into fewer proto-objects than regular patches, and each proto-object
can be efficiently encoded as a compact feature vector. This enables a
substantially smaller self-attention module that processes richer semantic
information. Our experiments demonstrate that this proto-object-based approach
matches or exceeds the state-of-the-art performance of patch-based
implementations with 62% less parameters and 2.6 times less training time.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper introduces a neuroevolution approach to self-attention using proto-objects (image segments) instead of patches, resulting in a more efficient self-attention module with comparable or better performance.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 本文介绍了一种神经进化方法，通过使用原始对象（图像分割）代替补丁来实现自注意力机制，从而产生更高效的自注意力模块，并实现相当或更好的性能。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(3/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i>
                    <span class="text-xs text-gray-500 ml-1">(9/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(6/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star-half-alt"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(5/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.00186v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Rafael C. Pinto, Anderson R. Tavares</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.65, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Uncertainty-Aware Multi-Expert Knowledge Distillation for Imbalanced Disease Grading</h2>
            <p class="paper-summary">Automatic disease image grading is a significant application of artificial
intelligence for healthcare, enabling faster and more accurate patient
assessments. However, domain shifts, which are exacerbated by data imbalance,
introduce bias into the model, posing deployment difficulties in clinical
applications. To address the problem, we propose a novel
\textbf{U}ncertainty-aware \textbf{M}ulti-experts \textbf{K}nowledge
\textbf{D}istillation (UMKD) framework to transfer knowledge from multiple
expert models to a single student model. Specifically, to extract
discriminative features, UMKD decouples task-agnostic and task-specific
features with shallow and compact feature alignment in the feature space. At
the output space, an uncertainty-aware decoupled distillation (UDD) mechanism
dynamically adjusts knowledge transfer weights based on expert model
uncertainties, ensuring robust and reliable distillation. Additionally, UMKD
also tackles the problems of model architecture heterogeneity and distribution
discrepancies between source and target domains, which are inadequately tackled
by previous KD approaches. Extensive experiments on histology prostate grading
(\textit{SICAPv2}) and fundus image grading (\textit{APTOS}) demonstrate that
UMKD achieves a new state-of-the-art in both source-imbalanced and
target-imbalanced scenarios, offering a robust and practical solution for
real-world disease image grading.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces an uncertainty-aware multi-expert knowledge distillation (umkd) framework to address data imbalance and domain shift in disease image grading, achieving state-of-the-art results on histology and fundus image grading datasets.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了一种不确定性感知的多专家知识蒸馏（umkd）框架，以解决疾病图像分级中的数据不平衡和领域转移问题，并在组织学和眼底图像分级数据集上取得了最新的成果。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(2/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(4/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.00592v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Shuo Tong, Shangde Gao, Ke Liu, Zihang Huang, Hongxia Xu, Haochao Ying, Jian Wu</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.7000000000000001, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">Towards Lightweight Hyperspectral Image Super-Resolution with Depthwise Separable Dilated Convolutional Network</h2>
            <p class="paper-summary">Deep neural networks have demonstrated highly competitive performance in
super-resolution (SR) for natural images by learning mappings from
low-resolution (LR) to high-resolution (HR) images. However, hyperspectral
super-resolution remains an ill-posed problem due to the high spectral
dimensionality of the data and the scarcity of available training samples.
Moreover, existing methods often rely on large models with a high number of
parameters or require the fusion with panchromatic or RGB images, both of which
are often impractical in real-world scenarios. Inspired by the MobileNet
architecture, we introduce a lightweight depthwise separable dilated
convolutional network (DSDCN) to address the aforementioned challenges.
Specifically, our model leverages multiple depthwise separable convolutions,
similar to the MobileNet architecture, and further incorporates a dilated
convolution fusion block to make the model more flexible for the extraction of
both spatial and spectral features. In addition, we propose a custom loss
function that combines mean squared error (MSE), an L2 norm
regularization-based constraint, and a spectral angle-based loss, ensuring the
preservation of both spectral and spatial details. The proposed model achieves
very competitive performance on two publicly available hyperspectral datasets,
making it well-suited for hyperspectral image super-resolution tasks. The
source codes are publicly available at:
\href{https://github.com/Usman1021/lightweight}{https://github.com/Usman1021/lightweight}.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: this paper introduces a lightweight depthwise separable dilated convolutional network (dsdcn) for hyperspectral image super-resolution, addressing limitations of existing methods in terms of model size and reliance on additional data sources.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了一种轻量级的深度可分离扩张卷积网络(dsdcn)，用于高光谱图像超分辨率，解决了现有方法在模型大小和依赖额外数据源方面的局限性。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(2/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(6/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(5/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(4/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.00374v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Usman Muhammad, Jorma Laaksonen, Lyudmila Mihaylova</p>
            
        </motion.div>
        
        <motion.div
            initial="{ opacity: 0, y: 50, scale: 0.9 }"
            whileInView="{ opacity: 1, y: 0, scale: 1 }"
            viewport="{ once: true, amount: 0.2 }" /* Trigger when 20% is visible */
            transition="{ duration: 0.5, delay: 0.75, ease: 'easeOut' }"  
            class="bento-item"
            data-motion-element
        >
            <h2 class="paper-title">AdCare-VLM: Leveraging Large Vision Language Model (LVLM) to Monitor Long-Term Medication Adherence and Care</h2>
            <p class="paper-summary">Chronic diseases, including diabetes, hypertension, asthma, HIV-AIDS,
epilepsy, and tuberculosis, necessitate rigorous adherence to medication to
avert disease progression, manage symptoms, and decrease mortality rates.
Adherence is frequently undermined by factors including patient behavior,
caregiver support, elevated medical costs, and insufficient healthcare
infrastructure. We propose AdCare-VLM, a specialized Video-LLaVA-based
multimodal large vision language model (LVLM) aimed at visual question
answering (VQA) concerning medication adherence through patient videos. We
employ a private dataset comprising 806 custom-annotated tuberculosis (TB)
medication monitoring videos, which have been labeled by clinical experts, to
fine-tune the model for adherence pattern detection. We present LLM-TB-VQA, a
detailed medical adherence VQA dataset that encompasses positive, negative, and
ambiguous adherence cases. Our method identifies correlations between visual
features, such as the clear visibility of the patient's face, medication, water
intake, and the act of ingestion, and their associated medical concepts in
captions. This facilitates the integration of aligned visual-linguistic
representations and improves multimodal interactions. Experimental results
indicate that our method surpasses parameter-efficient fine-tuning (PEFT)
enabled VLM models, such as LLaVA-V1.5 and Chat-UniVi, with absolute
improvements ranging from 3.1% to 3.54% across pre-trained, regular, and
low-rank adaptation (LoRA) configurations. Comprehensive ablation studies and
attention map visualizations substantiate our approach, enhancing
interpretability.</p>
            
            <p class="paper-tldr"><strong>TLDR</strong>: the paper introduces adcare-vlm, a video-llava-based lvlm fine-tuned on a custom tb medication monitoring video dataset for vqa concerning medication adherence, showing improvements over existing peft-enabled vlm models.</p>
            
            
            <p class="paper-tldr"><strong>TLDR</strong>: 该论文介绍了adcare-vlm，一个基于video-llava的lvlm，通过在一个定制的结核病药物监测视频数据集上进行微调，用于关于药物依从性的vqa，并且相较于现有的peft驱动的vlm模型，性能有所提升。</p>
            

            
            
            <div class="paper-sub-ratings" style="display: flex; flex-wrap: wrap; gap: 10px; margin-bottom: 5px; font-size: 0.8em;">
                
                <div class="rating-item">
                    <span class="rating-label">Relevance:</span>
                    
                    <i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(2/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Novelty:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star-half-alt"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(7/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Clarity:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(8/10)</span>
                </div>
                
                
                <div class="rating-item">
                    <span class="rating-label">Potential Impact:</span>
                    
                    <i class="fas fa-star"></i><i class="fas fa-star"></i><i class="fas fa-star"></i><i class="far fa-star"></i><i class="far fa-star"></i>
                    <span class="text-xs text-gray-500 ml-1">(6/10)</span>
                </div>
                
            </div>
            
            

            
            <div class="paper-rating">
                <span class="rating-label" style="color: #000; font-weight: bold;">Overall:</span>
                
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="fas fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                    
                        <i class="far fa-star"></i>
                    
                
                <span class="text-xs text-gray-500 ml-1">(4/10)</span>
            </div>
            

            <a href="http://arxiv.org/abs/2505.00275v1" target="_blank" class="paper-link">
                <i class="fas fa-file-pdf mr-1"></i> Read Paper (PDF)
            </a>
            
            <p class="paper-authors">Authors: Md Asaduzzaman Jabin, Hanqi Jiang, Yiwei Li, Patrick Kaggwa, Eugene Douglass, Juliet N. Sekandi, Tianming Liu</p>
            
        </motion.div>
        
    </div>

    <footer class="footer">
        Generated on 2025-05-04 04:29:33 UTC. Powered by <a href="https://github.com/onion-liu" target="_blank">onion-liu</a>.
    </footer>

</body>
</html>